(from de Jonge van der Loo)
Consistent data is technically correct data that is ready for statistical analysis.
This implies that data with missing values, special values, errors and outliers are, either:
The data is consistent with constraints based on real-world knowledge about the subject that the data describe.
Consistency can be understood to include:
We mainly will focus on methods dealing with in-record consistency, with the exception of outlier handling which can be considered a cross-record consistency issue.
The process towards consistent data always involves the following 3 steps:
The execution of three steps won't be always neated divided in different code executions. They can be used all in one execution. However, it is important to keep in mind the process towards Consistent Data.
This is the first step towards consistency satisfaction: to detect all localize errors.
A missing value, represented by (np.nan in Pandas), is a placeholder for a datum of which the type is known but its value isn't (technically correct but not consistent).
Therefore, it is impossible to perform statistical analysis on data where one or more values in the data are missing.
We have two options:
Missingness is something to be dealt with prior to any analysis.
In practice, analysts, but also commonly used numerical software may confuse a missing value with a default value or category.
In Pandas however, there's an explanation of how NA's are treated in each method
In [7]:
import numpy as np
import pandas as pd
s1 = pd.Series([0.9,0.6,np.nan,0.12])
s2 = pd.Series([0.9,0.6,0,0.12])
s3 = pd.Series([0.9,0.6,0.12])
display("s1 mean: {}".format(s1.mean()))
display("s2 mean: {}".format(s2.mean()))
display("s3 mean: {}".format(s3.mean()))
In [8]:
display("s1 sum: {}".format(s1.sum()))
display("s2 sum: {}".format(s2.sum()))
display("s3 sum: {}".format(s3.sum()))
In [3]:
np.nan + 2
Out[3]:
It should be up to the analyst to decide how empty values are handled, since a default imputation may yield unexpected or erroneous results for reasons that are hard to trace.
Pandas is not consistent with this treatment in contraposition with R, where most of the methods provide to the analysis the option of how to handle NA.
We already learnt how to deal with NaN:
In [12]:
# creata a DF
df = pd.DataFrame({"one":np.random.normal(0,1,5),
"two":np.random.normal(0,1,5),
"three":np.random.normal(0,1,5)})
# set somes values to nan
df.iloc[0,0] = np.nan
df.iloc[1,1] = np.nan
df.iloc[2,2] = np.nan
print("-> Modified DF")
display(df)
print("-> Check for nulls")
display(df.isnull().any(axis=1))
print("-> DF without nulls")
display(df.loc[-df.isnull().any(axis=1),:])
In [13]:
#to check where there is a null in rows
nulls = df.isnull().sum(axis=1)
print("-> Which rows have nulls")
display(nulls)
print("-> Which columns have nulls")
# to see whic columns have nans
display(df.isnull().sum())
In [158]:
import matplotlib.pyplot as plt
import numpy as np
s = pd.Series(np.arange(-5,5))
print("Linear space with an inversion")
display(pd.DataFrame({"s": s, "inv_s": 1/s}).head())
display(pd.DataFrame({"s": s, "inv_s": 1/s}).tail())
%matplotlib inline
pd.DataFrame({"s": s, "inv_s": 1/s}).plot("s","inv_s",
kind="scatter")
Out[158]:
In [160]:
print("-> Check where 1/s is finite")
display((1/s)[(1/s).map(lambda x: np.isfinite(x))])
print("-> Check how many finite values we have")
display((1/s).map(lambda x: np.isfinite(x)).sum())
print("-> Check where 1/s is finite")
display((1/s)[(1/s).map(lambda x: np.isinf(x))].count())
We have seen also pd.NaT value for timedeltas. We can deal with these values in the following way:
In [42]:
pd.isnull(pd.NaT)
Out[42]:
In [43]:
print("-> With is null we can check the precence of any of these values")
pd.isnull(pd.Series([pd.NaT,np.nan,None]))
Out[43]:
(from Pandas API Docs)
One has to be careful that in python (and numpy), the nan's don’t compare with equal, but None's do.
Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [9]:
None == None, np.nan == np.nan
Out[9]:
In [48]:
print("-> check that we can't use == to compare NaNs")
s = pd.Series([None,np.nan])
print((s == None).sum())
print((s == np.nan).sum())
print((s.isnull()).sum())
In general, when we are looking for errors, we may eventualy finish our process with a Boolan Series, which returns the position of the errors. For example if we want to constraint our variable to be non-negative:
In [162]:
# imagine we have the following data
data = np.random.normal(0,1,1000)
pd.DataFrame(data).hist(bins=100)
Out[162]:
We use a simple arithmetic operation vecotrized over the Series to find negative values
In [163]:
data<0
Out[163]:
Using this result, we can reasign an arbitrary value to negative data points this way
In [164]:
data[data<0] = np.nan
In [169]:
data[pd.Series(data).isnull()]=100
In [170]:
pd.DataFrame(data).hist(bins=100)
Out[170]:
There are two methods that can be applied over Boolean Series and DataFrames that are interesting for error detection:
Let's see an example to ilustrate:
In [171]:
df = pd.DataFrame(
{
"a" : np.random.normal(0,1, 100),
"b" : np.random.normal(6,1, 100)
}
)
df.head()
Out[171]:
In [172]:
(df<0).any()
Out[172]:
In [32]:
(df<0).any(axis=1).head()
Out[32]:
In [33]:
(df>0).all()
Out[33]:
There is a vast body of literature on outlier detection, and several definitions of outlier exist.
A general definition: an outlier in a data set as an observation (or set of observations) which appear to be inconsistent with that set of data.
Note that:
Let's see an example (data taken from here):
In [176]:
import pandas as pd
#load data
hw_df = pd.read_csv("../data/heigths_weights_dummy.tsv",sep="\t",index_col=0)
#chek if there exist nulls
num_nulls = hw_df.isnull().any(axis=1).sum()
if num_nulls > 0:
print("Warning: there are nulls")
hw_df.columns = ["height","weight"]
print("-> Original DF")
display(hw_df.sample(10))
#convert to metric
hw_df["height"] = hw_df["height"] * 2.54
hw_df["weight"] = hw_df["weight"] * 0.453592
#summary statistics
display(hw_df.sample(10))
display(hw_df.describe())
We have a dataset with weights and heights of people. Using describe we can see that:
However we have two (max) values (maybe not the same), that clearly overpass the mean: 246.76 and 218.80 respectively.
These values are considered outliers.
For more or less unimodal and symmetrically distributed data, Tukey's box-and-whisker method for outlier detection is often appropriate.
In this method, an observation is an outlier when it is larger than the so-called "whiskers" of the set of observations.
The upper whisker is computed by adding 1.5 times the interquartile range (IQR) to the third quartile and rounding to the nearest lower observation.
The lower whisker is computed likewise.
All values above higher whisker and lower whisker are considered outliers
What is the IQR?
In [67]:
#quantiles are proportional 4 divisions of the data
print("-> We can compute the quantiles easily with pandas.DataFrame.quantile()")
#note that the result is a DataFrame
display(hw_df.quantile([.25,.5,.75]))
#as stated, the IQR is de division between 1st and 3rd qauntiles
#this function will compute the rest
iqr_func = lambda x: x.iloc[1] - x.iloc[0]
#note that we just select the 1st and 3rd quantiles
iqr = hw_df.quantile([.25,.75]).apply(iqr_func)
display("Our IQR is:", iqr)
# add iqr*1.5 to both quartiles
high_out = hw_df.quantile([.75]) + iqr*1.5
low_out = hw_df.quantile([.25]) - iqr*1.5
print("-> high whiskers:")
display(high_out)
print("-> lower whiskers")
display(low_out)
In [177]:
print("-> Find upper outliers:")
display(hw_df[hw_df["height"] > high_out.loc[0.75,"height"]])
print("-> Find lower outliers:")
display(hw_df[hw_df["height"] < low_out.loc[0.25,"height"]])
print("-> The boxplot show us whiskers and outliers")
hw_df.boxplot(column=["height"],return_type="axes")
Out[177]:
In [75]:
print(hw_df[hw_df["weight"] > high_out.loc[0.75,"weight"]])
print(hw_df[hw_df["weight"] < low_out.loc[0.25,"weight"]] )
hw_df.boxplot(column=["weight"],return_type="axes")
Out[75]:
In [144]:
hw_df.plot(kind="scatter",x="height",y="weight")
Out[144]:
(from https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/#three)
Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers.
The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:
Let’s understand various types of outliers in more detail:
Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is \$100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes \$1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.
Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.
Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.
It is also common to get rid of everithing higher than 95 percentile and lower than 5 percentile
In [178]:
%matplotlib inline
#plot data with histograms
hw_df["height"].hist(bins=hw_df.shape[0])
print(".95 and .5 percentiles")
display(hw_df["height"].quantile([.05,.95]))
In [79]:
#save percentiles in vars
hl = hw_df["height"].quantile([.95])
ll = hw_df["height"].quantile([.05])
#filter and plot
hw_df["height"][(hw_df["height"] < hl.iloc[0]) & (hw_df["height"] > ll.iloc[0])].hist(bins=15)
Out[79]:
In [163]:
hw_df["weight"].hist(bins=hw_df.shape[0])
hw_df["weight"].quantile([.05,.95])
Out[163]:
In [88]:
print("-> Imagine that this dataset has errors")
df1 = pd.DataFrame({"a":[1,2,3,4,5],"b":["a","a","a","b","b"]})
display(df1)
print("-> This one has the correct value for values a and b")
s1 =pd.Series([1,2],index=["a","b"])
display(s1)
#match returns a series with the indexes with correct values in correct dataset
display(pd.match(df1.b,s1.index))
#we index the correct values in s1 using the result of the match operation
f = s1[pd.match(df1.b,s1.index)]
print("-> but we have a problem with indexes")
display(f)
print("-> we have a to reindex using the original index")
f.index = df1.index
display(f)
print("-> this technique is also useful for converting between diferent units")
display(df1.a * f)
You probably have seen that we have a message saying that pd.match is deprected
There exist other ways to do this operation, for example using join
In [100]:
df1 = pd.DataFrame({"a":[1,2,3,4,5],"b":["a","a","a","b","b"]})
s1 =pd.Series([1,2],index=["a","b"],name="correct")
print("-> Original DataFrame")
display(df1)
print("-> Correct Series")
display(s1)
#we have to reindex df1 so we can use both joins to merge the dataframe
#it is like a SQL join
df1.set_index("b").join(s1, how="inner")
Out[100]:
This technique can be useful for example when dealing with different units in a same variable. It is also useful for imputation.
Imagine that in sacramento estate csv, somebody tells you that Residential houses the unit is in square meters, instead of square feet.
A fast technique for transforming this is using match.
In [105]:
df = pd.read_csv("../data/Sacramentorealestatetransactions.csv")
#we can create a dataframe with conversion factors this way
# 1. use a default scalar
# 2. as index use the list of unique labels
print("-> labels:")
display(df.type.unique())
#create the dataframe
conv = pd.Series([1],index=df.type.unique(),dtype=float)
# 3. assign the corrector factor to each label
conv["Residential"] = 0.092903
# we create a new variable which is the factor
conv[pd.match(df.type,conv.index)].sample(10)
Out[105]:
In [30]:
# we assign it to a new variable
new_factor = conv[pd.match(df.type,conv.index)]
# reindex
new_factor.index = df.index
#then multiply
(df.sq__ft * new_factor).head()
Out[30]:
Exercise: Can you do the same with join?
Imputation is the process of estimating or deriving values for fields where data is missing.
There is a vast body of literature on imputation methods and it goes beyond the scope of this course to discuss all of them.
Here, we will use pandas to calculate values to impute and assign them to empty values.
In [111]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[0.8,0.4,1.8,np.nan],
"b":[2.3,np.nan,np.nan,5.6],
"c":[2.4,3.2,1.1,4.5],
"d":[np.nan,4.4,1.8,np.nan],
"tag":["a","a","b","b"]})
print("-> Original DataFrame")
display(df)
print("-> Fill NaNs")
display(df.fillna(0))
A more drastic approach would be to get rid of these observations or variables that contain an nan.
In [114]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[0.8,0.4,1.8,np.nan],
"b":[2.3,np.nan,np.nan,5.6],
"c":[2.4,3.2,1.1,4.5],
"d":[np.nan,4.+4,1.8,np.nan],
"tag":["a","a","b","b"]})
print("-> Original DataFrame")
display(df)
print("-> Drop rows with nans")
display(df.dropna(axis=0))
print("-> Drop columns with nans")
display(df.dropna(axis=1))
The first is imputation of the mean:
$$ \hat{x}_i = \bar{x}$$where the $\bar{x}$ is the imputation value and the mean is taken over the observed values.
The usability of this model is limited since it obviously causes a bias in measures of spread, estimated from the sample after imputation.
In principle one can use other measures of centrality. For example, imputation of the mean or median can be done as follows.
In [143]:
original_df = pd.DataFrame({"a":[0.8,0.4,1.8,np.nan],
"b":[2.3,np.nan,np.nan,5.6],
"c":[2.4,3.2,1.1,4.5],
"d":[np.nan,4.4,1.8,np.nan],
"tag":["a","a","b","b"]})
#to check where is a null
print("-> see which columns have nans")
nulls = original_df.isnull().sum(axis=1)
display(original_df[nulls == 1])
print("-> see which columns have nans")
display(original_df.isnull().sum())
print("-> see means")
display(original_df.mean())
#impute means
original_df.fillna(original_df.mean())[nulls == 1]
Out[143]:
We have a dicotomic variable, why don't we provide the mean of the sex gender?
In [150]:
original_df.pivot_table(index="tag",values=["a","b","c","d"],aggfunc=np.mean)
Out[150]:
How to fill it?
In [179]:
tag_means = original_df.pivot_table(index="tag",values=["a","b","c","d"],aggfunc=np.mean)
display(tag_means)
print("-> First imputation")
original_df.loc[(original_df.tag=="a") & (original_df.b.isnull()),"b"] = tag_means.loc["a","b"]
display(original_df[nulls == 1])
print("-> Second imputation")
original_df.loc[(original_df.tag=="a") & (original_df.d.isnull()),"d"] = tag_means.loc["a","d"]
display(original_df[nulls == 1])